We are going to talk about this data set.
df <- read.csv("~/Documents/school/info3130/data/fraudclaims.csv")
Practice using excel…..
Answer these questions: - What was the average claim amount paid in each country? (use subtools in excel) - What percent of claims were denied in each country? - On a percentage basis, how much more (or less) money in claims was paid in May than in June? - Which countries pay the most claims
# I'll answer them in R...
# What was the average claim amount paid in each country?
df_sub <- df[df$Claim_Paid == 1, ]
aggregate(as.numeric(Amt_Paid) ~ Country, df_sub, mean)
## Country as.numeric(Amt_Paid)
## 1 Australia 124.0000
## 2 Canada 263.7143
## 3 France 208.2000
## 4 Germany 127.5714
## 5 Italy 248.1429
## 6 Switzerland 392.6000
## 7 United Kindgom 165.8000
## 8 United States 377.8000
# What percent of claims were denied in each country?
prop.table(table(df$Country[df$Claim_Paid == 0]))
##
## Australia Canada France Germany Italy
## 0.17948718 0.12820513 0.23076923 0.10256410 0.15384615
## Switzerland United Kindgom United States
## 0.10256410 0.02564103 0.07692308
# On a percentage basis, how much more (or less) money in claims was paid in May than in June?
# Which country pays the most claims?
# Need to be more specific....
Outline CRISP DM - What are the processes for analyzing data? You can do a picture or ….whatever
(class discussion about data understanding and asking questions about our data.)
When I don’t have a million variables, I like to look at the data first.
df <- read.csv("./data/HeatingOil.csv")
library(GGally)
ggpairs(df)
# using excel, first install the data pack -> Descriptive Stats
# or, using R, we can just run this:
summary(df)
## Insulation Temperature Heating_Oil Num_Occupants
## Min. : 2.000 Min. :38.00 Min. :114.0 Min. : 1.000
## 1st Qu.: 4.000 1st Qu.:49.00 1st Qu.:148.2 1st Qu.: 2.000
## Median : 6.000 Median :60.00 Median :185.0 Median : 3.000
## Mean : 6.214 Mean :65.08 Mean :197.4 Mean : 3.113
## 3rd Qu.: 9.000 3rd Qu.:81.00 3rd Qu.:253.0 3rd Qu.: 4.000
## Max. :10.000 Max. :90.00 Max. :301.0 Max. :10.000
## Avg_Age Home_Size
## Min. :15.10 Min. :1.000
## 1st Qu.:29.70 1st Qu.:3.000
## Median :42.90 Median :5.000
## Mean :42.71 Mean :4.649
## 3rd Qu.:55.60 3rd Qu.:7.000
## Max. :72.20 Max. :8.000
Homework assignment. You can pose questions in your paper that you want to ask if you were asked to analyse the data.
Talking about correlation today. Data set will be the Heating Oil example.
oil <- read.csv("~/Documents/school/info3130/data/HeatingOil.csv")
cor(oil)
## Insulation Temperature Heating_Oil Num_Occupants
## Insulation 1.00000000 -0.79369606 0.73609688 -0.01256684
## Temperature -0.79369606 1.00000000 -0.77365974 0.01251864
## Heating_Oil 0.73609688 -0.77365974 1.00000000 -0.04163508
## Num_Occupants -0.01256684 0.01251864 -0.04163508 1.00000000
## Avg_Age 0.64298171 -0.67257949 0.84789052 -0.04803415
## Home_Size 0.20071164 -0.21393926 0.38119082 -0.02253438
## Avg_Age Home_Size
## Insulation 0.64298171 0.20071164
## Temperature -0.67257949 -0.21393926
## Heating_Oil 0.84789052 0.38119082
## Num_Occupants -0.04803415 -0.02253438
## Avg_Age 1.00000000 0.30655725
## Home_Size 0.30655725 1.00000000
Let’s see if I can remember the expected value of covariance… Here’s one way I remember it:
\[Cov(x, y) = E(xy) - E(x)E(y) = E[X - \mu_x]E[Y - \mu_y]\]
\[Cov(x, y) = E[XY - X\mu_y - Y\mu_x + \mu_x \mu_y] = E(XY) -\mu_y E(X) - \mu_x E(Y) + E(\mu_x \mu_y)\]
Now we should be able to see some of these terms dissapear remembering that \(E(X) = \mu_x\).
\[E(XY) - \mu_y \mu_x - \mu_x \mu_y + \mu_x \mu_y = E(XY) - \mu_x \mu_y\]
This is the covariance. To get the correlation we divid the covaviance by \(\sigma_x \sigma_y\) I believe.
Back to the example. Just some visualizations.
library(ggplot2)
library(plotly)
gg <- ggplot(data = oil, aes(x = Avg_Age, y = Heating_Oil, color = Insulation))
gg <- gg + geom_point()# + geom_jitter()
# get an interactive plot:
ggplotly(gg)
Now to calculate the correlations.
cor(oil)
## Insulation Temperature Heating_Oil Num_Occupants
## Insulation 1.00000000 -0.79369606 0.73609688 -0.01256684
## Temperature -0.79369606 1.00000000 -0.77365974 0.01251864
## Heating_Oil 0.73609688 -0.77365974 1.00000000 -0.04163508
## Num_Occupants -0.01256684 0.01251864 -0.04163508 1.00000000
## Avg_Age 0.64298171 -0.67257949 0.84789052 -0.04803415
## Home_Size 0.20071164 -0.21393926 0.38119082 -0.02253438
## Avg_Age Home_Size
## Insulation 0.64298171 0.20071164
## Temperature -0.67257949 -0.21393926
## Heating_Oil 0.84789052 0.38119082
## Num_Occupants -0.04803415 -0.02253438
## Avg_Age 1.00000000 0.30655725
## Home_Size 0.30655725 1.00000000
# following along in chapter 5
df <- read.csv("~/Documents/school/info3130/data/Chapter05DataSet.csv"
, colClasses = "factor")
df <- df[6:12]
df[df == 0] <- NA
library(arules)
rules <- apriori(df, parameter = list(minlen=2, supp=0.2,
conf=0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.2 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 696
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[7 item(s), 3483 transaction(s)] done [0.00s].
## sorting and recoding items ... [4 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [4 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(rules)
## lhs rhs support confidence lift count
## [1] {Hobbies=1} => {Religious=1} 0.2388745 0.7961722 1.901967 832
## [2] {Religious=1} => {Hobbies=1} 0.2388745 0.5706447 1.901967 832
## [3] {Family=1} => {Religious=1} 0.2245191 0.5758468 1.375634 782
## [4] {Religious=1} => {Family=1} 0.2245191 0.5363512 1.375634 782
df <- read.csv("~/Documents/school/info3130/data/Chapter06DataSet.csv")
km <- kmeans(df[-3], 3) # look at k plot
data.frame(clust_size = km$size, km$centers)
## clust_size Weight Cholesterol
## 1 191 110.4607 125.9791
## 2 185 141.9946 173.2486
## 3 171 182.2632 217.0409
car::spm(df, diagonal = "histogram",
reg.line = NULL, smoother = NULL)
Visualize a different way (not using principal components): First plot all variables
km <- kmeans(df[,c("Weight", "Cholesterol", "Gender")], centers = 4)
# first plot the data with color overlay:
df$Gender <- ifelse(df$Gender == 0, "female", "male")
gg <- ggplot(data = df, aes(x = Weight, y = Cholesterol, color = Gender))
gg + geom_point()
# like in the book:
df$Kgroup <- as.factor(km$cluster)
gg <- ggplot(data = df, aes(x = Weight, y = Cholesterol, color = Kgroup))
gg + geom_point()
OK fine, looks like it is grouping OK.
Here’s some data.
library(ggplot2)
df <- data.frame(x = rnorm(361), y = rnorm(361))
theta <- seq(0, 360, 1) * (pi / 180)
x <- 6 * cos(theta) + rnorm(361, sd = 0.25)
y <- 6 * sin(theta) + rnorm(361, sd = 0.25)
df <- rbind(df, data.frame(x, y))
ggplot(data = df, aes(x = x, y = y)) + geom_point()
There’s two distinct clusters.
K-means gets it wrong.
km <- kmeans(df, centers = 2)
df$cluster <- as.factor(km$cluster)
ggplot(data = df, aes(x = x, y = y, color = cluster)) + geom_point() +
ggtitle("K-means Clustering") + theme(legend.title = element_blank())
Hierarchal clustering gets it right.
d <- dist(df[-3])
hc <- hclust(d = d, method = "single")
memb <- cutree(hc, k = 2)
df$hclust <- as.factor(memb)
ggplot(data = df, aes(x = x, y = y, color = hclust)) + geom_point() +
ggtitle("Hierarchal Clustering") + theme(legend.title = element_blank())
df <- read.csv("~/Documents/school/info3130/data/uncatStudents.csv")
df_ab <- as.data.frame(df[, "Absences"])
names(df_ab) <- "Absences"
mean(df$Absences)
## [1] 2.519618
# we HAVE to consider outliers when using K-means
hist(df$Absences, col = "green", xlab = "Absences", main = "")
km2 <- kmeans(df_ab, centers = 2)
km2$centers
## Absences
## 1 1.022388
## 2 6.194139
# what about 4 clusters?
km4 <- kmeans(df_ab, centers = 4)
km4$centers
## Absences
## 1 9.8571429
## 2 3.6450000
## 3 6.8742515
## 4 0.6423488
How many clusters should there be?
n <- 8
wss <- rep(0, n)
for (i in 1:n){
wss[i] <- sum(kmeans(df[-(1:3)], centers = i)$withinss)
}
plot(wss, type = "b", xlab = "k", ylab = "Within groups SS",
xlim = c(1, n + 0.5))
text(1:n, wss, pos = 4, round(wss,3), cex = 0.65)
K should probably be 2. This is nice since we can visualize this using principal components…
k <- 4
pc <- prcomp(df[-(1:3)])
km <- kmeans(df[, -(1:3)], centers = k)
temp <- data.frame(x = pc$x[, 1], y = pc$x[, 2], z = factor(km$cluster))
ggplot(data = temp, aes(x = x, y = y, color = z)) + geom_point() +
theme(legend.title = element_blank()) + ggtitle(paste("K =", k))
Test coming next week.
He does not do T/F or MC
5 or 6 questions for this one.
4 will be hands on the keyboard, applied knowledge questions. Descriptive stats: Central tendency, dispersion, \(2 \sigma\) (outliers).
Correlation: pearson, spearman, kendall Association rules. You need to have binomial variables. K-means (today). Centroids by hand Crisp-DM. What are the 6 parts and what order?
For the test we will download a word doc.